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INTRODUCTION 

A valuable source of three-dimensional (3-D) information is provided by the rel- 
ative motions of elements in the changing two-dimensional (2-D) image. The remark- 
able ability of the human visual system to recover 3-D structure from motion was 
explored in many early perceptual studies (for example, Wallach & O'Connell, 1953; 
Gibson & Gibson, 1957; White & Mueser, 1960; Green, 1961; Braunstein, 1976; Jo- 
hansson, 1973, 1978; Rogers & Graham, 1979; Ullman, 1979). These studies reveal 
that the human system can recover the structure of rigid and nonrigid objects, un- 
der perspective and orthographic projection, and in the absence of all other cues to 
3-D structure. Early perceptual work typically focused on the recovery of qualitative 
aspects of an object's structure, such as its apparent rigidity, volume or coherence. 
More recently, studies have addressed quantitatively, the accuracy of perceived 3-D 
structure (for example, Todd, 1982, 1984, 1985; Lappin & Fuqua, 1983; Braunstein et 
al., 1987; Loomis & Eby, 1988, 1989; Dosher et al., 1989; Sperling et al., 1989). 

This paper presents a set of psychophysical experiments that examine both the 
accuracy of the 3-D model computed by the human visual system and the time course 
of the buildup of perceived structure. The experiments are motivated in part by 
a computational model proposed by Ullman (1984), called the incremental rigidity 
scheme, in which an accurate 3-D structure is built up incrementally, by considering 
images of moving objects over an extended time period. If this model captures some 
aspects of the human recovery of structure from motion, then two critical predictions 
arise. First, the accuracy of perceived structure should increase over an extended 
time period, and second, the current perception of structure should strongly influence 
later perceptions. Also, in contrast to previous structure-from-motion models, the 
incremental rigidity scheme exhibits good performance in the presence of noise in the 
image motion measurements, due in part to the integration of motion information 
over time. This observation raises the question of how well the human visual system 
performs in the presence of image noise. Through psychophysical experiments, we 
examine these three questions. Through computer simulations, we then relate the 
results of our psychophysical experiments with the predictions of Ullman's model. 

The next section briefly reviews previous computational and perceptual studies of 
the recovery of structure from motion and introduces the model proposed by Ullman 
(1984). We then present a series of experiments in which subjects perform the sim- 
ple task of ordering a set of moving points in depth. Through these experiments, we 
examine the accuracy of perceived structure, the nature of its buildup over time and 
its sensitivity to noise in the visual image. We conclude that when viewing displays 
containing as few as three points undergoing relative motion, the human visual system 
can derive quite an accurate model of the relative depths of the points, even in the 
presence of noise. The accuracy of the 3-D model improves with time; some observers 
show continued improvement up to about one second of viewing. Performance eventu- 



ally reaches a plateau, beyond which there is no further improvement. After presenting 
the experimental results, we describe a set of computer simulations that reveal that 
the early time course of the buildup of perceived 3-D structure is similar to that pre- 
dicted by one formulation of Ullman's model. These experiments also provide some 
evidence that the 3-D structure currently perceived depends on previous 3-D models. 
The implications of our observations for the computation of structure from motion in 
the human visual system are addressed in our final discussion. 

THE COMPUTATION OF 3-D STRUCTURE FROM MOTION 

In studying the computation of structure from motion, one immediately faces the 
problem that the recovery of structure is underconstrained; there are infinitely many 
3-D structures consistent with a given pattern of motion in the changing 2-D image. 
Additional constraint is required to establish a unique interpretation. Early perceptual 
studies suggest that the presumed rigidity of objects may play a key role in the re- 
covery of structure from motion (Wallach & O'Connell, 1953; Gibson & Gibson, 1957; 
Green, 1961; Jansson & Johansson, 1973; Johansson, 1973, 1977). Computational 
studies establish that rigidity is a sufficiently powerful constraint to derive a unique 
interpretation of structure under a variety of viewing conditions. 

From theoretical studies, it can be concluded that by exploiting a rigidity con- 
straint, a unique 3-D structure can be recovered from motion information alone, using 
image measurements that are integrated over a small extent in space and in time (for 
example, Ullman, 1979, 1983; Clocksin, 1980; Longuet-Higgins & Prazdny, 1980; Tsai 
& Huang, 1981; Hoffman, 1982; Prazdny, 1983; Kanatani, 1985; Waxman & Ullman, 
1985; Mitiche, 1986; Waxman & Wohn, 1988). A review of many of these results can 
be found in Ullman (1983), Barron (1984) and Hildreth and Koch (1987). Theoret- 
ical studies have also given rise to algorithms for deriving the rigid 3-D structure of 
moving objects. Experimentation with these algorithms has revealed two important 
limitations. First, although it is possible in theory to recover structure from motion 
information that is integrated over a small extent in space and time, such a strategy 
may not be robust in practice (an error analysis for the case of limited temporal extent 
can, for example, be found in Weng, Huang and Ahuja (1989)). A small amount of 
error in the image measurements can lead to very different solutions (Ullman, 1983). 
Second, most previous algorithms derive a 3-D structure only when a rigid interpre- 
tation is possible, and otherwise do not yield any interpretation of structure or yield 
a solution that is incorrect or unstable. 

Theoretical studies suggest that a robust algorithm for recovering structure should 
use motion information that is more extended in space or time. This conclusion is 
supported in recent computational studies (for example, Bruss & Horn, 1983; Lawton, 
1983; Ullman, 1984; Adiv, 1985; Bolles & Baker, 1985; Negahdaripour & Horn, 1985; 
Yasumoto & Medioni, 1985; Bharwani et al., 1986; Shariat & Price, 1986; Subbarao, 



1986; Landy, 1987; Waxman k Wohn, 1988; Bhanu k Burger, 1988). 

With regard to the human visual system, the dependence of perceived structure 
on the spatial and temporal extent of the viewed motion has not yet been studied 
systematically, but the following informal observations have been made. Regarding 
spatial extent, two or three points undergoing relative motion are sufficient to elicit a 
perception of 3-D structure (Borjesson k von Hofsten, 1973; Lappin k Fuqua, 1983; 
Braunstein et al., 1987; Petersik, 1987), although theoretically the recovery of struc- 
ture is less constrained for two points in motion, and perceptually the sensation of 
structure is weaker. An increase in the number of moving elements in view can yield 
a more compelling sense of 3-D shape (Todd et al., 1988; Dosher et al., 1989; Sperling 
et al., 1989), but its influence on the accuracy of perceived structure is unclear (see, 
for example, Petersik, 1980; Braunstein et al., 1987). Regarding the temporal extent 
of viewed motion, Johansson (1973) showed that a brief observation of patterns of 
moving lights generated by human figures moving in the dark (commonly referred to 
as biological motion displays) can lead to a perception of the 3-D motion and struc- 
ture of the figures. Other perceptual studies indicate that the human visual system 
requires an extended time period to reach an accurate perception of 3-D structure 
(Wallach k O'Connell, 1953; White k Mueser, 1960; Braunstein k Andersen, 1984b; 
Doner, Lappin k Perfetto, 1984; Braunstein et al., 1987; Siegel k Andersen, 1988; 
Husain, Treue k Andersen, 1989). A brief observation of a moving pattern sometimes 
yields an impression of structure that is flatter than the true structure of the moving 
object. Thus, the human visual system can derive some sense of structure from mo- 
tion information that is integrated over a small extent in space and time. An accurate 
perception of structure, however, may require a more extended viewing period. 

The sensitivity of early structure-from-motion algorithms to error in the image 
motion measurements raises the question of how sensitive is the human recovery of 
structure to image noise. Lappin, Doner and Kottas (1980) showed that small amounts 
of noise could disrupt subjects' ability to discriminate between different amounts of 
coherence in structure-from-motion displays. This study used only two frames in 
alternation, however. Other experiments have shown that subjects can tolerate larger 
amounts of noise when extended sequences of images are used (Petersik, 1979; Doner, 
Lappin k Perfetto, 1984; Todd, 1984, 1985; Husain, Treue k Andersen, 1989). Todd's 
studies, in particular, show that subjects can make an accurate assessment of 3-D 
shape and motion in the presence of large amounts of visual noise. 

Although most algorithms for recovering structure from motion are unable to 
interpret nonrigid motions, there are exceptions to this that can interpret restricted 
classes of nonrigid motions (for example, Rashid, 1980; Hoffman k Flinchbaugh, 1982; 
Bennett k Hoffman, 1985; Koenderink k Van Doom, 1986; Subbarao, 1986). The 
mechanism for recovering structure from motion in the human visual system appears 
not to be based strictly on the rigidity assumption. It is an everyday experience to 



perceive the structure and motion of deforming objects such as a flowing river, an 
expanding balloon, or a dancing ballerina. Such experiences are rich with many cues 
to 3-D structure. In controlled perceptual studies that isolate relative movement as a 
single cue to 3-D structure, however, it also appears that the human visual system can 
derive some sense of structure for a broad range of nonrigid motions, including stretch- 
ing, bending and even more complex types of deformations (for example, Johansson, 
1973, 1978; Jansson & Johansson, 1973; Cutting, 1982; Todd, 1982, 1984, 1985; Loomis 
& Eby, 1988, 1989). Furthermore, displays of rigid objects in motion sometimes give 
rise to the perception of somewhat distorting objects (Wallach, Weisz & Adams, 1956; 
White & Mueser, 1960; Braunstein, 1976; Sperling et al., 1989; Schwartz & Sperling, 
1983; Braunstein & Andersen, 1984a; Adelson, 1985; Loomis & Eby, 1988, 1989). 

Recently, Ullman (1984) proposed a more flexible method for deriving the struc- 
ture of rigid and nonrigid objects that provides a natural means for integrating motion 
information over an extended time period. This method makes use of the rigidity as- 
sumption, but in a more flexible way than previous studies. The algorithm, called the 
incremental rigidity scheme, maintains an internal model of the structure of a moving 
object, which is continually updated as new positions of image elements are considered. 
The initial model may be flat, if no other cues to 3-D structure are present, or it may 
be determined by other cues available, for example, from binocular stereopsis, shading, 
texture or perspective. As each new view of the moving object appears, the algorithm 
computes new 3-D coordinates for points on the object, which maximize the rigidity 
in the tranformation from the current model to the new positions. In particular, the 
algorithm minimizes the change in the 3-D distances between points in the model. 
The formulation presented by Ullman assumes the input to the recovery process to 
consist of a sequence of discrete frames, each containing a set of discrete feature points 
whose positions are obtained by orthographic projection of the scene onto the image 
plane. Through the process of repeatedly considering a new frame in the sequence and 
updating the current model of the structure of the moving features, the incremental 
rigidity scheme builds up and maintains a 3-D model, and can be applied both to rigid 
and nonrigid objects in motion. Recent extensions to the incremental rigidity scheme 
use velocity information directly as input to the recovery of structure from motion, 
and perspective projection (Grzywacz & Hildreth, 1987; for further details, see also 
Grzywacz & Hildreth, 1985). Landy (1987) presents a parallel structure-from-motion 
model that implements a similar scheme in a cooperative network. Details of three dif- 
ferent formulations of the incremental rigidity scheme addressed in this paper appear 
in the section on computer simulations. Other models have been suggested that impose 
rigidity by requiring that the 3-D distances between points in space change very little 
from one moment to the next (for example, Mitiche, 1986; Subbarao, 1986; Jasinschi 
& Yuille, 1989), although these models do not build up 3-D structure incrementally 
as in Ullman's proposed scheme. 
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The features of the incremental rigidity scheme that distinguish it from other 
models are the buildup of an accurate 3-D model over an extended time and the use 
of a current 3-D model as an explicit source of constraint on the model computed at 
the next moment. In most other structure-from-motion models, the computed 3-D 
structure at each moment is constrained only by direct visual input that is integrated 
over a small window in time. The next sections explore whether the recovery of 
structure from motion in the human visual system exhibits these salient properties. 

GENERAL METHODS 

We first describe aspects of the visual stimuli and experimental procedure that 
are common to the entire set of experiments. The experimental design was guided 
by a number of considerations. First, the task relies on an objective judgement that 
does not require the observer to form an internal subjective scale of properties such 
as 3-D distance or amount of rigidity. Second, the subject is able to perform the 
task in a sufficiently short time period that we can measure the time course of the 
early buildup of 3-D structure. If, on the other hand, it only required one or two 
seconds for the visual system to compute an accurate model of 3-D structure, but 
several seconds of observation time were needed for the subject to make a judgement, 
then we could only assess the accuracy of the final computed 3-D model. Such a task 
would not allow us to explore the intermediate structures perceived in the first one or 
two seconds of viewing time. Third, we have designed a task that relies as much as 
possible on the derived 3-D structure of the moving elements, rather than their raw 
2-D positions or velocities in the display. Because the recovery of 3-D structure from 
motion necessarily relies on properties of the changing 2-D projection, one cannot 
guarantee that observer's judgements are not based directly on 2-D information, but 
we chose an experimental design that makes it very difficult for the observer to use 
2-D cues directly. Finally, the experiments make use of a quantitative judgement that 
requires only relative movement as the source of 3-D information. 

Subjects 

The authors, who are all trained psychophysical observers, served as the subjects 
for these experiments. 

The Visual Stimuli 

A set of three points distributed in space was rotated around a central axis and 
projected onto a 2-D computer display, using orthographic projection. To describe the 
stimuli in more detail, let us assume a coordinate system in which the X and Y axes 
are the horizontal and vertical axes in space and in the image plane, which are the 
same under orthographic projection, and the positive Z axis is directed perpendicular 
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Figure 1. The ordinal experiment, (a) Side view of the experimental setup, 
indicating the slanted axis of rotation, (b) Projection of a typical configuration 
of points onto a plane perpendicular to the axis of rotation. The circular outlines 
indicate the annulus within which the points are located, (c) Bird's eye view of the 
final positions of the three points, indicating their separation in depth, 7. 



to the image plane, away from the viewer. For the first experiment, the axis of rotation 
of the three points was slanted 10° away from the image plane, as shown in Figure 
la. For later experiments, the axis of rotation was parallel to the image plane. When 
projected onto a plane perpendicular to the axis of rotation, the positions of the three 
points always lie within an annulus, as indicated in Figure lb. The outer boundary of 
the annulus restricts the overall range of X and Z coordinates of the moving points. 
The reason for restricting the points to lie outside the inner boundary of the annulus 
is that points located near the center of this circular projection would move very 
little under rotation. A point that is moving by only a small amount could easily be 
identified as lying near the center of the cylindrical volume encompassing the set of 
three points. The use of an annulus as shown in Figure lb removes this potential cue 
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to 3-D structure. 

The particular configurations of points were chosen such that for any given rota- 
tion, the positions of the points were evenly spaced in depth in the final frame that 
was viewed. Let 7 denote the displacements in depth for this final view, as shown in 
Figure lc. The parameter, 7, is defined in units of picture elements on the display. 
In the plane of the display, which is perpendicular to the observer's line of sight, one 
picture element corresponds to a visual angle of 2.5'. 

The Y coordinates of the three points (vertical positions in the image plane) were 
chosen such that there was always a minimum separation between vertical positions of 
25' of visual arc. The size of each point was 10' x 10' of visual angle and the resolution 
of the display was such that the positions of the points could be set to a resolution 
of 2.5'. The overall size of the window of the computer display in which the points 
appeared was 10° x 10°. 

The display itself was a monochrome video monitor from a Symbolics Lisp Ma- 
chine, with a fast decaying (P4) phosphor. The experiments used black dots on a 
white background, in order to reduce the possible effects of persistence of the display. 

For each trial in the experiment, the points were rotated through a given total 
angular extent in increments of 1.5° around the central axis. A discrete frame was 
created for each angular position of the points and the entire set of frames was displayed 
as a movie. The presentation time for each frame was approximately 33 msec. There 
was no interstimulus interval (ISI) — after the 33 msec presentation time, each frame 
was immediately replaced by the next frame in the sequence. A fixation mark also 
appeared in every frame and the subject was required to fixate on the mark throughout 
the duration of each movie. 

Experimental Procedure 

For each trial, the first frame of the movie appeared on the display, the subject 
pushed a button to indicate that he was ready; the movie was then displayed. The 
distance of the viewer from the display was 0.4 meters. Viewing was monocular and 
in a dark room. 

The subjects were asked to specify which of the three points was located midway 
in depth between the other two. Because orthographic projection is used, there are 
two possible rigid structures corresponding to the changing projections — one is the 
structure used to generate the frames and the second is its reversal in depth. Both 
solutions share the same central point, so the outcome of a given trial is not effected 
by whether the subject sees a given structure or its depth reversal. 

To specify their choice to the computer, subjects were given a box with three 
buttons oriented vertically and were told to associate the vertical ordering of the three 
buttons on the box with the vertical ordering of the positions of the points on the 
display. After each trial, the subject pushed the appropriate button depending on 





whether the top, middle or bottom point was perceived as being between the other 
two points in depth. No feedback regarding the correctness of the response was given. 
This lack of feedback reduced the likelihood for subjects to use simple tricks based 
directly on 2-D cues for performing the task. 

The accuracy of perceived 3-D structure can be assessed by measuring how well 
subjects perform this task as we vary the separation between the points in depth, 7, 
while the time course of the buildup of this accuracy can be assessed by measuring 
subjects' performance as the angular extent of rotation of the points is increased. The 
particular stimulus parameters used are indicated in the discussion of each individual 
experiment. 

EXPERIMENT 1 

This first experiment addresses the time course of the buildup of accuracy of 
perceived 3-D structure, by measuring subjects' performance as the angular extent of 
rotation is varied. 

The Visual Stimuli and Experimental Procedure 

Six different angular extents were used in this experiment: 0°, 3°, 6°, 15°, 30°, 45°. 
A single experimental session consisted of 324 trials with 81 different configurations 
of the three points. For this first experiment, the displacement in depth was 7 = 70. 
Through a single session, the 81 configurations were repeated four times in random 
order. The different angular extents appeared in blocks of 27 trials, with the ordering 
of the blocks randomized. The 0° condition served as a control, to show that there 
were no static cues to depth in these displays. For each experimental condition, we 
computed the percentage of correct responses by the subject. 

The axis of rotation was slanted 10° away from the image plane, as shown in 
Figure la. The use of the slanted rotation axis made it possible to design the 81 
configurations in a way that made it very difficult for subjects to base their judgement 
of relative depth directly on 2-D information about positions or velocities in the image. 
There was no bias for particular positions on the computer display, that is, the point 
corresponding to the correct response was equally likely to appear on the top, middle 
or bottom in the vertical direction, and to the left, right or middle in the horizontal 
direction. Also, there was no bias for particular image velocities, that is, the correct 
point was equally likely to move with the highest, middle or lowest image velocity. 
Consequently, it was not possible for the subjects to base their judgement directly on 
simple properties of the 2-D image positions or velocities of the points. The use of the 
slanted rotation axis was essential for removing the 2-D velocity cue. If the axis of 
rotation were parallel to the image plane, then projected image velocity could provide 
a direct cue to depth. 
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Experimental Results 

For this first experiment, data was gathered for two subjects, ECH and NMG, 
and is displayed in Figure 2. The angular extents of rotation are indicated on the 
vertical axis of the graph shown in Figure 2 and the percentage of correct responses 
appears on the horizontal axis. Each data point represents the result of 324 trials. 
Error bars indicate a single standard error of the mean. A chance level of performance 
corresponds to 33%. Both subjects performed at chance for 0° of rotation. 
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Figure 2. Results of Experiment 1. Data for ECH (squares) and NMG (circles) 
are shown superimposed. The percentage of correct responses is plotted against 
the total angular rotation. Standard errors are displayed as vertical bars. The data 
shows improvement in performance with increased angular extents of motion. 



The first observation that can be made is that subjects did exhibit a buildup in the 
accuracy of the perceived 3-D structure of the points. There was a rise in performance 
level from 0° to 30° of rotation for the subject ECH (squares), while performance for 
the subject NMG (circles) continued to rise up to 45°. Later experiments indicate that 
performance typically reaches a plateau at 30° and 45° of rotation for the subjects ECH 
and NMG, respectively. For the longer angular rotations, subjects found it increasingly 
difficult to maintain the perception of a rigid structure over the full extent of rotation; 
the points sometimes appeared to move independently of one another in a nonrigid 
manner. 

It should be emphasized that the total viewing time and total angular extent 
of viewed motion are directly coupled in these visual displays. We cannot conclude 
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from this experiment alone whether the accuracy of perceived structure depends most 
critically on the extension of the visual stimulus in time, across space, or both. If 
the extent in time alone is a factor in determining this accuracy, we note that the 
rotation of 30° corresponds to a total viewing time of 768 msec, which is in rough 
agreement with the time expected from the studies of Andersen and Siegel (1988; 
Siegel & Andersen, 1988; see also, Husain, Treue & Andersen, 1989) indicating that 
several hundred msec of viewing time is required to make visual discriminations of 
motion that are essential to the detection of 3-D structure. 

EXPERIMENT 2 

The second experiment explores both the temporal buildup of perceived 3-D struc- 
ture and the accuracy of the perceived structure. Accuracy is assessed by measuring 
subjects' performance as the final separation in depth between the points, 7, is varied. 

The Visual Stimuli and Experimental Procedure 

The experimental setup is similar to that of the first experiment, but with an 
important exception. The points were now rotated around a central vertical axis 
(that is, parallel to the image plane). The reason for this was a practical one. In 
the first experiment, there was no bias for particular image velocities, in that the 
point that was midway in depth at the end of the rotation was equally likely to move 
with the slowest, middle or fastest velocity in the image. Because of the nature of 
the geometric projection, for smaller final separations of the points, 7, this uniform 
distribution of velocities could only be achieved if the points were allowed to have 
large vertical separations. The task itself, however, becomes difficult when the points 
are widely separated in the image, because the points appear to decouple from one 
another. We therefore decided to use a vertical axis of rotation in this experiment, 
allowing the configurations to become more compact in the vertical direction. While 
this introduces a potential 2-D velocity cue, we believe that the subjects were not 
using this velocity cue directly, for reasons that are elaborated upon in our discussion 
of the results of this experiment. 

A single experimental session consisted of 256 trials and used 64 different configu- 
rations of the three points and 4 different angular extents of rotation. The displacement 
in depth, 7, was kept constant for a single session, but was now varied between sessions. 
Through a single session, the 64 configurations were repeated four times in random 
order. The different angular extents appeared in blocks of 16 trials, with the ordering 
of the blocks randomized. As a control, the points again remained stationary in some 
trials. The 64 configurations were chosen such that there was no bias for particular 
positions on the computer display (the point corresponding to the correct response was 
equally likely to appear on the top, middle or bottom in the vertical direction, and 
to the left, right or middle in the horizontal direction). A total of 8 different angular 
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extents were used in this experiment: 0°, 6°, 15°, 30°, 45°, 60°, 90°, 177°. In addition, 
5 different depth displacements, 7 were used: 10,20,30,40,50. For each experimental 
condition, we computed the percentage of correct responses by the subject. 

Experimental Results 

Complete data were gathered for three subjects, ECH, VKI and NMG. Partial 
data was also obtained for a fourth subject, which confirmed the general trends seen 
in the data. Individual data for ECH and NMG are displayed in Figures 3a and 3b, 
respectively. The data are shown separately, because quantitative differences between 
the performance of individual subjects were observed. The two sets of data shown here 
represent the range of performance observed. Figure 3c shows the results of averaging 
the data obtained for the three subjects. The angular extents of rotation are indicated 
on the horizontal axis of the graphs shown in Figure 3 and the percentage of correct 
responses appears on the vertical axis. Data for different displacements 7 are drawn 
on separate curves. As a control, we again verified that subjects were at a chance level 
of performance for 0° of rotation. Each data point in Figure 3 represents the result of 
64 trials (that is, 64 different configurations). 

It can again be observed that subjects showed a buildup in the accuracy of the 
perceived 3-D structure of the points. Particularly for the larger values of 7, there 
was a steady rise in performance level from 0° of rotation that reached a plateau at 
about 30° of rotation for subject ECH (Figure 3a) and about 45° of rotation for NMG 
(Figure 3b). After only 6° of rotation, subjects already reached a level of performance 
that was often within about 15% of the level at which performance reached a plateau. 
After about 30° - 45° of rotation, performance generally did not continue to improve 
with larger extents of motion. The level of performance dropped for very large angular 
extents. (When tested against a binomial distribution with probability 0.5 that the 
level of performance was larger at 180° than at 90°, a significant drop was found 
(n = 18, P < 0.05).) Subjects again found that it was difficult to maintain the 
perception of a rigid structure over such a long viewing period; the points sometimes 
appeared to move independently of one another in a nonrigid manner. 

Performance was generally worse for smaller 7. For the largest value of 7 tested, 
7 = 50, performance for some subjects reached the 90% level of performance. For 
the smallest displacement, 7 = 10, subjects were not significantly above chance, while 
for 7 = 20, subjects were well above chance for all angular extents of rotation. This 
suggests that the threshold for discrimination of relative depth from motion may lie 
somewhere in this range of displacements. For the case where 7 = 20, the relative 
depth between the points is 1.45% of the distance between the observer and computer 
display, that is, the threshold was roughly 0.6 cm from a viewing distance of 40 cm. 
Measurements need to be made at additional viewing distances, however, to derive a 
reliable measure of structure-from-motion acuity. 
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Figure 3. Results of Experiment 2. The percentage of correct responses is plotted 
against the total angular rotation. The five different curves in each figure correspond 
to the five displacements in depth, 7 = 10 (circles, lower curve), 20 (diamonds), 
30 (triangles), 40 (squares) and 50 (circles, upper curve), (a) Data for subject 
ECH. (b) Data for subject NMG. (c) Average data for three subjects. Performance 
improves both with angular extent of rotation and with increased displacement in 
depth. 
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Comparing the data here to that of Experiment 1, we can see that the overall level 
of performance is significantly higher in this case. There are at least three possible 
reasons for this. First, because we are using vertical axis rotation and orthographic 
projection here, there is no coupling between the vertical positions of the points in the 
image and their position in depth. This allowed us to construct configurations that 
were more compact in the vertical direction. The relative 3-D structure is easier to 
judge when the projected points are closer to one another. Second, the accuracy of the 
3-D structure derived from structure-from-motion algorithms typically degrades as 
the axis of rotation is slanted further away from the image plane. This occurs because, 
in general, the amount of relative motion between points that is due to their relative 
depths decreases as the axis of rotation is slanted further away from the image plane. If 
the rotation axis is slanted by 90°, so that it is now perpendicular to the image plane, 
there is no relative movement due to relative depths. Computer simulations with the 
incremental rigidity scheme (Hildreth and Grzywacz, unpublished observations) show 
a steady decline in the accuracy of computed 3-D structure as the angle of slant of the 
rotation axis is increased from 0° to 90°. Loomis and Eby (1988, 1989) recently showed 
that the human visual system also exhibits this behavior, which could contribute to a 
drop in performance in Experiment 1, where the axis of rotation is slanted 10° from 
the image plane. 

A third reason for the improved performance in Experiment 2 may be the intro- 
duction of the potential 2-D velocity cue into the experimental setup. The point that 
is midway in depth also has a velocity that is between that of the other two points. 
Subjects were instructed to base their judgements on perceived 3-D positions of the 
points and subjectively reported doing so, but may have inadvertently used the 2-D 
velocity cue, at least on some trials. We believe, however, that subjects were not mak- 
ing sole use of the 2-D velocity cue, for the following reasons. First, when we began 
these experiments, we were not aware of this potential 2-D cue, and found later that if 
we explicitly try to use this cue, we perform substantially better. Second, psychophys- 
ical studies indicate that the size of the temporal integration window for measuring 
image velocities ranges between 80 msec for high velocities to about 200 msec for 
medium range velocities (see, for example, McKee & Welch, 1985). Integration times 
for measuring very slow velocities may be longer (S. McKee, personal communication), 
but such low velocities rarely occur in the visual stimuli used here. The buildup in 
accuracy of perceived 2-D image velocities therefore cannot account for the more ex- 
tended temporal buildup in performance that we see here, which lasts on the order of 
800 - 1000 msec. If the subjects based their judgements on 2-D velocities directly, 
one might expect performance to reach a plateau at only about 200 msec. Third, the 
consistency of the time course measured here with that measured in Experiment 1, 
where the direct 2-D velocity cue was not available, also suggests that the buildup 
of accuracy largely reflects the increased accuracy in perceived 3-D structure rather 
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than perceived image velocities. In particular, the experiments indicate a consistent 
plateau in performance level at 30° for ECH and 45° for NMG. 

EXPERIMENT 3 

It was noted earlier that a common limitation of many structure-from-motion 
algorithms is extreme sensitivity to noise in the visual image. The incremental rigidity 
scheme, however, is quite robust against noise, in part because it integrates visual 
information over an extended time period and also because it allows deviations from 
rigidity. 

Experiment 3 examines the nature of the degradation in human performance on 
the same ordinal task described earlier, as a function of the amount of noise introduced 
in the stimulus. In the section on computer simulations, we compare the psychophys- 
ical data with the behavior of the incremental rigidity scheme. 

The Visual Stimuli and Experimental Procedure 

The experimental procedure used here was similar to that used in Experiment 
2. The visual stimuli differed in the following way. In each discrete frame, Gaussian 
distributed noise was added to the X and Y positions of the points in the projected 
image. The space constant a for the Gaussian was held constant throughout a single 
experimental session, and varied between sessions. The levels of noise used here were 
sufficiently large that the erratic motion of the points was very apparent. 

Experimental Results 

Individual data for two subjects, ECH and NMG, are shown in Figures 4a and 4b, 
respectively. Only a single displacement in depth, 7 = 40, was used in this experiment. 
The subject ECH performed the experiment with added Gaussian noise for which 
a = 2.0 and 4.0 (expressed in terms of visual angle, a = 5' and 10' of visual arc). The 
subject NMG performed the experiment with a = 2.0, 4.0 and 6.0. 

For the subject ECH (data shown in Figure 4a), the added noise uniformly de- 
graded performance for all of the angular extents of rotation. The drop in performance 
for smaller rotations was not significantly greater than that for larger rotations. Sub- 
jectively, the task appeared much more difficult for the larger level of noise, but the 
difference in mean performance between the two noise levels was small. The effect of 
the noise was qualitatively similar for the subject NMG (data shown in Figure 4b), 
although the decrease in performance for a = 2.0 and 4.0 was smaller than that seen 
in the data for the subject ECH. For each level of noise, a plateau in performance was 
reached after 45° or 60° of rotation. Data collected for a third subject was essentially 
the same as that shown in Figure 4a. 
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Figure 4. Results of the noise experiments. The percentage of correct responses 
is plotted against the total angular rotation. The different curves correspond to 
different levels of added Gaussian noise; a — 0.0 (circles), a = 2.0 (squares), a = 4.0 
(triangles) and a = 6.0 (diamonds). 7 = 40 for all experimental sessions, (a) Data 
for subject ECH. (b) Data for subject NMG. There is a gradual degradation in 
performance with increased levels of noise. 
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The overall data for the subject NMG indicates less sensitivity to noise than that 
demonstrated for the subject ECH. Recall that in the first two experiments, NMG 
showed a slower buildup in the accuracy of his internal model of the 3-D structure 
of the points. This slower buildup may suggest a longer integration time for recover- 
ing structure, which is likely to yield the lower sensitivity to noise exhibited in this 
experiment. 

It is significant that performance at this task does not entirely break down with 
the large levels of noise used here, that is, does not drop quickly to a chance level 
of performance. The degradation for small angular extents of motion is expected, as 
the added noise sometimes makes the displacements of the points totally incorrect, 
given their true 3-D structure, for all or most of their extent of motion. Subjectively, 
it was observed that for the large angular extents, the "average" displacement of the 
points over their full trajectory can still be judged and used to interpret their rough 
3-D structure. The data here suggests that the human visual system may not rely on 
precise measurements of the velocities and accelerations of image features, but rather 
may require only rough estimates of the positions or motions of image features, perhaps 
over an extended time period. 

EXPERIMENT 4 

It was noted earlier that a second salient feature of the incremental rigidity scheme 
is the dependence of the current 3-D model on past 3-D models. This is different from 
other structure-from-motion algorithms, in which the computation of 3-D structure 
at a particular moment in time depends only on the visual input measured over some 
limited time frame. This last experiment attempts to test whether perceived 3-D 
structure depends on previous 3-D structures derived by the visual system. 

Early observations from computer simulations with the incremental rigidity scheme 
indicated that the algorithm sometimes behaves differently when started with differ- 
ent initial models of the 3-D structure of a set of points (Hildreth and Grzywacz, 
unpublished observations). In general, if the algorithm begins with an initial model 
that is compressed relative to the true structure of the object, then the algorithm will 
converge toward the true structure. If, on the other hand, the algorithm begins with 
a model that is very stretched (say, by a factor of two) relative to the true structure, 
then the algorithm often settles into a stretched and slightly nonrigid structure, rather 
than converging toward the true, rigid structure. In this experiment, relative move- 
ment was used to establish an initial perception of 3-D structure that was different 
from the structure on which the observer was to be tested. We examined the influence 
of different initial 3-D models on the subsequent perception of structure. 

Much of the experimental setup here was similar to that used in Experiment 2. For 
each trial, however, the motion of the three-point configurations viewed in Experiment 
2 was now immediately preceded by the motion of a different configuration of three 
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points. The last view of the preceding configuration was arranged to coincide with 
the first view of the configuration on which the observer was tested, so that there was 
no abrupt transition between the motion of the first and second configurations. The 
experiment examined three different types of motion preceding the movement of the 
test configurations: 

(1) In the first condition, the preceding configurations were both stretched relative 
to the test configurations and the ordering of the points in depth was randomized. 
The ratio between the stretched and true depths was, on average, between 4:1 
and 5:1. Both the stretched and test configurations were rotated around a central 
vertical axis. 

(2) In the second condition, the preceding configuration was such that at the 
transition point at which the motions of the two configurations join, the preceding 
configuration was flat and in the plane of the computer display. Both the planar 
and test configurations of points were rotated around a central vertical axis. 

(3) In the third condition, the initial configuration was identical to the test 
configuration, but was rotated around the line of sight. (This rotation should not 
convey any information about the real 3-D structure of the points.) After being 
rotated around the line of sight, the configuration was then rotated around the 
vertical axis as before. 

Based on observations from computer simulations, we expect that if the incremen- 
tal rigidity scheme is an appropriate model for the human recovery of structure from 
motion, then the first manipulation above should lead to a substantial degradation in 
the subjects' ability to judge accurately the structure of the test configuration. This 
is because we are initially priming the subject to see a stretched 3-D configuration, in 
which the ordering of the points may be different from that of the test configuration. 
Suppose that the initial stretched configuration is used by the visual system as an ex- 
plicit source of constraint on the subsequent recovery of the 3-D structure of the test 
pattern. The test pattern might then be forced to look somewhat stretched and with 
an incorrect ordering of the points in depth relative to the true structure. If this were 
the case, one would expect a degradation in the quality of the 3-D structure attributed 
to the test pattern. We might not expect perceived structure to remain incorrect in- 
definitely, however. Internal noise, for example, might initiate changes in perceived 
structure. In addition, observations by Adelson (1985) suggest that compact views of 
a rotating 3-D object tend to be interpreted as the projection of a compact object 
in 3-D space, rather than a stretched object viewed from an unusual angle. Thus, 
an object that cycles between stretched and compact 2-D views typically appears to 
distort continuously. This phenomenon is also likely to occur in our displays. 

The second and third manipulations, on the other hand, should not lead to sub- 
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stantial degradation in performance. This expectation is based on the observation that 
in general, the incremental rigidity scheme will converge quickly to the true structure 
of a rotating object, if it begins with an initial model that is compressed in depth, 
relative to the true structure. The algorithm assumes that in the absence of other 3-D 
cues, the initial model is flat and parallel to the image plane. The second and third 
manipulations described above only serve to strengthen the perception of a fiat initial 
3-D structure, prior to the motion of the test configuration. 

The Visual Stimuli and Experimental Procedure 

For each condition, experimental sessions were run with and without the previous 
configuration. In the case of condition (1), the stretched points were rotated around 
the central vertical axis for 36° in increments of 1°. The stretched pattern was then 
immediately followed by the rotating test configurations, which were constructed as 
described in Experiment 2. For condition (2), the flat configuration of points was 
rotated around the central vertical axis for 45° in increments of 1.5°, and then followed 
immediately by the test configuration. For condition (3), the configurations of points 
were initially rotated by 90° around the line of sight in increments of 1.5° before being 
rotated around the central vertical axis. For all three conditions, the test configurations 
were rotated only for small total angular extents (6°, 15°, 30° and 45° of rotation), in 
increments of 1.5° of rotation per frame. The subjects ran two sessions, each containing 
256 trials, and the percentages of correct responses were calculated. 

Experimental Results 

Average data for the subjects NMG, ECH and VKI are shown in Figure 5 for 
condition (1), in which the initial configuration is stretched in depth. Each data point 
represents the results of 384 trials. We show both the data for the control experiment 
(circles), in which only the test configuration appeared, and the data for the case in 
which the stretched configuration appeared first (triangles). The vertical bars indicate 
standard errors. 

The presence of the initial stretched configuration degrades the subsequent com- 
putation of the structure of the test configuration. The drop in performance was large 
for 6° and 15° of rotation, but there was no significant drop in performance for 30° 
and 45° of rotation. In terms of the total time over which performance was affected, 
individual data indicated that this influence extended for about 500 — 600 msec for the 
subjects ECH and VKI, while for the subject NMG, an influence could still be seen 
after about 1200 msec. 

Figures 6 and 7 show average data for the subjects ECH and NMG, for the two 
control conditions, in which the preceding configuration was flat and rotated around 
the vertical axis, or rotated around the line of sight. The data points each represent 
the results of 256 trials. In the case of Figure 6, there is a small, but significant, drop 
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Figure 5. Average data for the experiment with an initial stretched structure, 
for subjects ECH, NMG and VKI. Circles correspond to the control condition and 
triangles to the case where the stretched configuration appeared first. Vertical bars 
indicate standard errors. The initial stretched configuration leads to degradation 
in performance that lasts up to about 30° of rotation of the test configuration. 



in performance for the smallest angular rotation, but there are otherwise no significant 
differences between the data for the control and test conditions. 

Some of the degradation in performance seen in this experiment can be attributed 
to the size of the temporal integration period that is used to measure retinal image 
motion. It appears, however, that this integration period could only account for the 
degradation seen for the smallest angular extent of rotation, 6°, which extended over a 
viewing time of roughly 150 msec. As we noted earlier, psychophysical studies indicate 
that the size of the temporal integration window for measuring image velocities ranges 
between 80 msec for high velocities to about 200 msec for medium range velocities (see, 
for example, McKee & Welch, 1985). We therefore do not believe that the extended 
influence seen here (up to a second or so) can be accounted for on the basis of the 
mechanisms by which retinal image motion is first measured. In further support of 
these conclusions, the second and third manipulations used to generate the preceding 
configurations (data shown in Figures 6 and 7) should also influence the performance 
of the motion measurement stage, but did not lead to an extended influence on the 
quality of the perceived 3-D structure. 

For the case of the incremental rigidity scheme, the structure computed at a 
particular moment appears to depend on previous 3-D models. Based on computer 
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Figure 6. Average data for the experiment in which an initial planar configura- 
tion is presented, for subjects ECH and NMG. Circles correspond to the control 
condition and triangles to the case where the planar configuration appeared first. 
Vertical bars indicate standard errors. There is degradation in performance only 
for the smallest angular rotation, in contrast to the results obtained for the initial 
stretched configuration shown in Figure 5. 



simulations with this model, we expected that if observers initially viewed a stretched 
configuration, then there would be a significant degradation in the quality of the 
structure perceived at later times. On the other hand, if a flat configuration were 
viewed initially, we expected no degradation at later times (except for the shortest 
angular extent of rotation). Our experimental data showed these expectations to hold 
true for a limited time frame of a second or so. When the initial configuration was 
stretched, however, the perceived 3-D structure eventually 'collapsed' to the true, 
more compact structure, while the algorithm typically remains in the stretched 3- 
D interpretation indefinitely. Thus our experiments lend support to the notion that 
previous perceptions of 3-D structure constrain future models, but suggest that some 
modification is required to the incremental rigidity scheme to account fully for human 
behavior. Note that our perceptual experience is consistent with the experimental 
observations by Adelson (1985) mentioned earlier, which suggest that even for rigid 
objects, there is a tendency to perceive objects that are compact in the image as being 
compact in 3-D space, which can lead to nonrigid perceptions. 
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Figure 7. Average data for the experiment in which the test configuration is 
initially rotated around the line of sight, for subjects ECH and NMG. Circles cor- 
respond to the control condition and triangles to the case where the configuration 
was first rotated around the line of sight. Vertical bars indicate standard errors. 
There are no significant drops in performance, in contrast to the results obtained 
for the initial stretched configuration shown in Figure 5. 
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COMPUTER SIMULATIONS 



Through computer simulations, we examined the quantitative behavior of three 
different formulations of the incremental rigidity scheme at a task similar to that used 
in the psychophysical experiments. We refer to the first two formulations as Ullman's 
discrete model and flexible model, and to the third formulation as the continuous model. 
We first describe the three basic algorithms and then present the results of computer 
simulations. 

Ullman's Discrete Model 

Ullman's discrete formulation of the incremental rigidity scheme assumes the vi- 
sual input to consist of a sequence of frames, each containing a number of discrete 
points that may correspond to identifiable features in the changing image. The scheme 
maintains and updates an internal model M(t) of the viewed objects, which consists 
of a set of 3-D coordinates: M(t) = (xi(t), yi(t), Zi(t)). All of the formulations used in 
the simulations here assume orthographic projection onto the X — Y image plane, so 
that (xj(<), Vi{t)) are the image coordinates of the z'-th point, and z,-(t) is the current 
estimate of the depth at the i-th point (see Grzywacz & Hildreth (1987) for formula- 
tions that use perspective projection). When no other 3-D cues are present, the initial 
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model M(t) at t = is taken to be flat; that is, Zi(0) = (or some other constant 
value) for i = 1, . . . , n, where n is the number of points in motion. 

Given a current model M(t) at time t and the image of the moving points in 
a new frame at a later time t', the problem is to compute a new model M(t') such 
that the transformation from M(t) to M(t') is as rigid as possible. Since Xi(t') and 
yi(t') are known, this requires the computation of the unknown depth values Zi(t'). 
(It is assumed that the correspondence between points in the two successive frames is 
known.) The new depth values are computed as follows. Let hj(t) denote the distance 
between points i and j at time t. To make the transformation as rigid as possible, the 
values Zi(t') for the new model are chosen so as to make hj(i) and hj(t') as similar as 
possible. For this purpose, Ullman defined a measure of the difference between hj(t) 
and hj(t') as: 

and formulated the recovery of structure as the computation of Zi(t') that minimize 
the following overall deviation from rigidity: 

D d (t,t') = Y J d(h j (t),l ij (t')). (2) 

After the values Zi(t') have been determined using this minimization process, the new 
model M(t') = (xi(t'),yi(t'),Zi(t')) becomes the current model. A new frame is then 
registered and the process repeats itself. In this way, the scheme maintains rigidity by 
keeping the total distances between points in the model as constant as possible. The 
motivation for the cubic factor in the denominator of Equation (1) is that the nearest 
neighbors to a given point are more likely to belong to the same object than distant 
neighbors, so that a point is more likely to move rigidly with its nearest neighbors. 
The Zf ■(<) factor diminishes the influence of distant points in the recovery of structure. 
It should be noted that in the case of orthographic projection, only relative depth 
values, Z{(t) — Zj(t), can be recovered, rather than absolute depth values, because 
under this form of projection, the image of a given object does not change with its 
absolute depth. In addition, 3-D structure is determined only up to a reflection about 
the image plane, since the orthographic projection of a rotating object, and its mirror 
image rotating in the opposite direction, coincide. Further analysis and variations of 
this discrete model can be found in Grzywacz and Hildreth (1987). 

Ullman's Flexible Model 

The flexible model is a modification of the discrete model that allows the internal 
model at two consecutive instants to be corrected simultaneously. The scheme searches 
for a modified, corrected model M'(t) such that the transition from M(t) to M'(i) is 
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small, and the transition from M'(t) to M(t') (the model at time t') is as rigid as 
possible. The flexible model minimizes the sum: 

Df(t,t') = £ [d(/y(0,'ii(0) + <'W*),'o-(*'))] (3) 

where /;,(<) refers to the distances between pairs of features in the model M'(t). 



The Continuous Model 

It is also possible to develop a continuous formulation of the incremental rigidity 
scheme, which uses velocity information at discrete feature points in a continuously 
changing image as input to the recovery of 3-D structure (Grzywacz & Hildreth, 1987). 
We assume again that there always exists an internal model M{i) = (xi(t), yi(t), Zi(t)), 
and that the image velocities £;(*) and y,(<) are known. The problem is then formu- 
lated as the computation of the z components of velocity, i,(<), that minimize the 
total continuous change in the distances between the points. The measure of overall 
deviation from rigidity is given by: 

£cW = Ert;«) 2 ( 4 ) 

where hj(t) denotes the time derivative of the distances /,-_,-(*), which is dependent 
on the velocities (x,-(t), yi(t),£i(t)). The additional factor of fy that appears in the 
denomonator of Equation (1) could also be used in the measure shown here. In other 
respects, the continuous model is similar to Ullman's discrete model. A model of 
the structure of the moving points is built up by continually taking into account 
new velocity information over an extended time period. Again, because orthographic 
projection is used, only relative velocities, Zi{t) — Zj(t), can be recovered. Further 
details of the continuous model can be found in Grzywacz and Hildreth (1987). 

Simulation Results 

This section describes the performance of the three different formulations of the 
incremental rigidity scheme on a task similar to that used in the psychophysical ex- 
periments and compares these simulation results with human performance. Our main 
conclusion is that the qualitative behaviors of the original discrete model and the con- 
tinuous model differ significantly from human performance, but the behavior of the 
flexible model is qualitatively similar to the psychophysical data, at least for the initial 
rise in performance level for smaller angular rotations. 

Configurations of three points were chosen in a way that was similar to the vi- 
sual stimuli used in the psychophysical experiments, with the positions of the points 
distributed evenly in depth in the final frame. For each of the three models, the con- 
figurations were rotated around the vertical axis for different total angular extents 
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and images of the points were computed at discrete positions in the trajectory. The 
simulations of the discrete and flexible models used angular rotations between frames 
of 10° and a new 3-D model was computed after every 10° of rotation. It has been 
shown that the performance of the incremental rigidity scheme degrades as the an- 
gle of rotation between frames becomes small (Ullman, 1984; Grzywacz & Hildreth, 
1987; Landy, 1987). A rotation of 10° was chosen so that we could obtain results for 
a number of different angular extents in the range from 10° to 180°, without caus- 
ing significant degradation in the performance of the algorithm. Simulations of the 
continuous model used angular rotations between frames of 0.1°, so that a new 3-D 
model was computed every 0.1°. In some simulations, Gaussian distributed noise was 
added to the positions of the points in each frame. At the end of a given sequence of 
images, when the frame that contains the points that are equally spaced in depth was 
reached, it was determined whether the point that occurred in the middle (in depth) 
in the computed 3-D model was the correct middle point. 

In the first set of simulations, the displacements in depth was 7 = 40. For the 
simulations of the discrete and continuous models, no noise was added to the positions 
of the moving points. For the simulations of the flexible model, a small amount of 
Gaussian noise was added, corresponding to a = 2.0. Figure 8 shows the percentage of 
correct responses obtained in the computer simulations, for each of the three models. 
The results for the discrete and continuous formulations are shown with circles and 
triangles, respectively, and the results of the flexible model are shown as squares. Each 
datapoint represents the results of 256 trials. A chance level of performance (33%) was 
assumed for a rotation of 0°, because the algorithm begins with a flat configuration in 
which the Z coordinates of the three points are the same. Also shown in the figure is 
a plot of the psychophysical data from Experiment 2 (crosses)) obtained for 7 = 40, 
averaged over the three subjects (from Figure 3c). 

Some of the degradation in performance seen in this experiment can be attributed 
to the size of the temporal integration period that is used to 

Consider first the behavior of Ullman's discrete model. In the absence of error 
in the positions of the points, the discrete model eventually converges to a perfect 
3-D model and does not reach a plateau at the lower levels of performance seen in 
the psychophysical experiments. In addition, this model exhibits a slower rise in 
performance for smaller angular rotations, in comparison with the human data. Thus 
the discrete model appears not to perform as well as human subjects for small angular 
rotations, but eventually reaches a significantly better level of performance for larger 
extents of rotation. We found that a higher level of performance is reached, even 
for large amounts of added noise in the visual input. The added noise also degrades 
performance for smaller extents of rotation. 

It is interesting to note that early models proposed for recovering structure from 
motion were not considered viable models for the human recovery of structure, due to 
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Figure 8. Results of computer simulations. Results of the three models of the 
incremental rigidity scheme applied to visual stimuli similar to those used in Ex- 
periment 2. The graphs correspond to the results of the discrete (circles) and 
continuous (triangles) formulations, the flexible model (squares) and psychophysi- 
cal data (crosses). The percentage of correct responses is plotted as a function of 
the angular extent of rotation. Human performance exhibits a more rapid early 
buildup in accuracy than the models, but the models continue to improve in perfor- 
mance after the psychophysical data reaches a plateau. The flexible model yields 
the best fit to the experimental data. 



their extreme sensitivity to noise in the visual input. The use of a more flexible rigidity 
constraint, together with the notion of building up a structure incrementally over time, 
as proposed by Ullman (1984), has led to an algorithm that can perform better than 
the human visual system in some circumstances. In order for the discrete formulation 
of the incremental rigidity scheme to remain viable as a model of human performance, 
it needs to be modified in a way that yields both better short-term performance and 
worse performance over extended times. It is possible that some of the differences in 
observed performance are the consequence of properties of the motion measurement 
mechanisms preceding the recovery of 3-D structure, which determine the precision 
of the input position or velocity measurements. Also, these differences may arise from 
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properties of the way in which the internal model is accessed and the deterioration of 
its memory over time. 

Consider now the continuous model. The study by Grzywacz and Hildreth (1987) 
showed that the continuous model can provide a good estimate of structure over a 
short period of time, but then oscillates between good and poor models of structure 
over an extended time period. It does not yield as stable a long-term recovery of 
structure as that provided by Ullman's discrete model. From the results shown in 
Figure 8, it can be seen that in the absence of image noise, the continuous model 
also reaches an almost perfect level of performance that is significantly higher than 
the performance levels reached by human subjects. We expect that for larger angular 
extents of rotation, the performance of the continuous model will drop, because of its 
oscillatory behavior. The time course of the early buildup of structure is similar to 
that observed with the discrete model. Simulations showed the continuous model also 
to be quite robust against noise, but is not as robust as the discrete model. From 
these simulation results, we conclude that the quantitative behavior of the continuous 
model also does not appear to agree well with that of human subjects. Similar to 
the discrete model, it exhibits a slower rise for small angular rotations and eventually 
reaches a higher level of performance. 

The final model that we consider is Ullman's flexible model. In Figure 8, the re- 
sults for this model are closer to the psychophysical data for angular extents of rotation 
up to 90°, although human performance is still somewhat better. The flexible model 
generally builds up a 3-D structure more quickly than the discrete and continuous 
models. Like the other models, however, it eventually reaches an essentially perfect 
level of performance for long extents of motion. 

We also examined the behavior of the flexible model when a large amount of 
Gaussian noise is added to the positions of the points in each image frame. Our 
motivation here is simply to show that similar to human performance, there is a 
graceful degradation in the behavior of the incremental rigidity scheme with increased 
noise. Figure 9a shows the performance of the flexible model when Gaussian noise, 
for which a — 8.0, is added to the image frames (squares; shown in comparison to 
the case in which only a small level of noise is added (circles)). For angular extents 
less than 90°, there is a drop in performance with the larger amount of added noise, 
but the algorithm eventually performs at an almost perfect level in both cases. It is 
difficult to compare, quantitatively, the effects of noise here with the effects observed in 
the psychophysical experiments, because the nature of the noise in the psychophysical 
experiments and simulations is different. The simulations use discrete frames at every 
10° of rotation, while in the experiments, noise is added to frames that are generated 
for every 1.5° of rotation. The temporal smoothing that takes place in the early stages 
of human vision will also tend to smooth out some of the added noise. 

Figure 9b shows the results of computer simulations with the flexible model, for 



28 



(a) 
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Figure 9. Simulations with the flexible model, (a) The results of the flexible 
model applied to the visual stimuli used in Experiment 2, with added Gaussian 
noise in the positions of points in the image frames (<r = 2.0 (circles) and a = 8.0 
(squares)). Performance degrades gracefully with additional noise, (b) The results 
of the flexible model for different displacements in depth, 7 = 40 (circles) and 20 
(squares). Performance is worse for the smaller displacement. 
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two different displacements in depth, 7 = 40 and 20 (shown with circles and squares, 
respectively). Similar to human behavior, there is an overall drop in performance for 
the smaller separation in depth. 

Overall, we observe qualitative similarities in performance between the flexible 
model and human subjects for smaller angular rotations (up to 90° or so), but all of 
the models outperform human subjects for longer rotations. This may be due in part 
to the difficulty that human observers experienced in maintaining the perception of 
a rigid configuration of points for long viewing times. It is also possible that human 
observers are easily confused when the ordering of the points in depth changes during 
the long angular rotations. If the points were presented in a way that strengthened 
their apparent rigidity (for example, they were connected with solid lines), there might 
be an improvement in human performance for longer viewing times. 

The key difference between the flexible model and the other two models may be 
the nature of the updating strategy used at each moment. Allowing the current repre- 
sentation of 3-D structure to change leads to a more rapid early buildup of structure. 
Note that this updating strategy could be incorporated into either a position-based 
or velocity-based algorithm. 

GENERAL DISCUSSION 

This paper presented a series of experiments that assess the accuracy of perceived 
structure, its sensitivity to noise in the visual image, and the nature of its buildup 
over time. Our main conclusions are the following. First, the human visual system 
can derive quite an accurate model of the relative depths of isolated moving points, 
even in the presence of noise in their image positions. Second, the accuracy of the 3-D 
model improves with time, eventually reaching a plateau, beyond which there is no 
further improvement. Third, there is some evidence that the 3-D structure currently 
perceived depends on previous 3-D models. 

The issues of the time course of the buildup of accuracy of perceived 3-D structure 
and the possible dependence of the currently perceived structure on past 3-D models 
were specifically motivated by Ullman's incremental rigidity scheme. It is expected that 
there will be some temporal buildup in accuracy, due to the extended temporal window 
over which image motion is first measured. The extent of this temporal window, 
however, is typically on the order of 80— 100 msec (for example, McKee & Welch, 1985). 
Ullman's model proposes that the recovery of structure itself takes place incrementally 
over a longer time frame. This possibility is supported by our experiments. Subjects 
showed a buildup in accuracy of perceived structure over a second or so, with some 
variation between subjects. In quantitative terms, the early time course is similar to 
that expected by Ullman's flexible model. We also found evidence in our experiments 
suggesting that the currently perceived structure does depend on past models, although 
the temporal extent of this effect also may be limited to a second or so. 



30 

A limitation of many computational models has been an extreme sensitivity to 
noise in the visual input. Some models have attempted to overcome this sensitiv- 
ity by integrating motion measurements at a single moment, but over large spatial 
areas (for example, Bruss & Horn, 1983; Lawton, 1983; Adiv, 1985; Negahdaripour 
& Horn, 1985; Ullman, 1984; Waxman & Wohn, 1988), while others overcome this 
problem through integration of motion measurements over time (Ullman, 1984; Bolles 
& Baker, 1985; Bharwani et al., 1986; Shariat & Price, 1986; Landy, 1987; Bhanu & 
Burger, 1988). The results of our experiment with added noise in the visual stimu- 
lus suggests that the human system can derive a rough estimate of structure in the 
presence of large amounts of noise, even when viewing only three points in motion. 
We may integrate motion information over large spatial regions for some tasks, such 
as the recovery of observer motion, but an extensive spatial integration by itself can- 
not account for our experimental observations. This suggests that the integration of 
motion measurements over time, which may be coupled with viewing the motion over 
larger spatial extents, may be a more important factor in reducing sensitivity to noise. 
These noise experiments also suggest that the human visual system may not rely on 
precise measurements of the velocities and accelerations of image features, but rather 
may require only rough estimates of the motion of image features, perhaps over an 
extended time period. This observation is consistent with recent studies of L. Vaina 
(personal communication) regarding patients with visual deficits, which indicate that 
patients that lose the ability to make precise velocity discriminations may still be able 
to recover 3-D structure from motion. 

Our experiments couple the total viewing time with the total spatial extent of 
viewed motion. Other perceptual studies indicate that for extended viewing periods, 
during which 3-D objects formed from random dots are allowed to oscillate back and 
forth, the accuracy in perceived 3-D structure increases with the angular extent of 
rotation (for example, Braunstein et al., 1987; Todd et al., 1988; Loomis & Eby, 1988, 
1989). Thus the angular extent of viewed motion is by itself a critical factor in de- 
termining the accuracy of perceived structure. Simulations with Ullman's incremental 
rigidity scheme show that if the spatial extent of rotation of an object is kept constant, 
but its temporal extent is varied by oscillating the object back and forth through mul- 
tiple cycles, then the computed 3-D structure will continue to improve over time 
(Hildreth and Grzywacz, unpublished observations). We conducted a pilot experiment 
in which our configurations of three points were oscillated back and forth, and did not 
find any significant improvement in performance over longer times. Thus it remains 
unclear whether temporal extent of motion is, by itself, a factor in determining the 
accuracy of computed structure. 

While there exists some qualitative similarity between Ullman's incremental rigid- 
ity scheme and the human recovery of 3-D structure from motion, there are also clear 
differences revealed in our experiments. In particular, the somewhat faster early in- 
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crease in human performance, and the flattening off of perfo r m ance at a level that is 
significantly less than perfect poee a chaHengp that opens the way lot tether develop- 
ment of the model. 
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